V18 Manifold-Guided Architecture — val_bpb 0.434#663
V18 Manifold-Guided Architecture — val_bpb 0.434#663raahilg wants to merge 2 commits intoopenai:mainfrom
Conversation
Rephrase sentence for clarity regarding model initialization.
There was a problem hiding this comment.
Pull request overview
Adds a new Parameter Golf submission (“V18 Manifold-Guided Architecture + Sparsemax Routing”) including the training script, run logs for two seeds, and the submission metadata/README describing results.
Changes:
- Introduces a new
train_gpt.pyimplementing manifold construction + sparsemax-routed multi-hop message passing + manifold-guided attention, with int8+zlib export and roundtrip eval. - Adds training logs for seed 42 and seed 27 runs (including post-quant BPB).
- Adds
submission.jsonand a README documenting the approach and reported metrics.
Reviewed changes
Copilot reviewed 3 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_10min_16mb/2026-03-24_V18_ManifoldGuided_Sparsemax/train_gpt.py | New training/manifold/quantization script for the V18 submission. |
| records/track_10min_16mb/2026-03-24_V18_ManifoldGuided_Sparsemax/train_seed42.log | Seed 42 training record and final int8+zlib roundtrip metrics. |
| records/track_10min_16mb/2026-03-24_V18_ManifoldGuided_Sparsemax/train_seed27.log | Seed 27 training record and final int8+zlib roundtrip metrics. |
| records/track_10min_16mb/2026-03-24_V18_ManifoldGuided_Sparsemax/submission.json | Submission metadata (score, size, blurb, author/date). |
| records/track_10min_16mb/2026-03-24_V18_ManifoldGuided_Sparsemax/README.md | Write-up of the method, results table, and run instructions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if not files: | ||
| raise FileNotFoundError(f"No files found for pattern: {pattern}") | ||
| tokens = torch.cat([load_data_shard(file) for file in files]).contiguous() | ||
| usable = ((tokens.numel() - 1) // seq_len) * seq_len |
There was a problem hiding this comment.
load_validation_tokens can return a too-short tensor when the validation split has fewer than seq_len + 1 tokens (or when seq_len is set too large). That leads to total_seqs == 0 in eval_val and a divide-by-zero when computing val_loss/val_bpb. Add the same usable <= 0 guard the baseline scripts use and raise a clear ValueError when the validation set is too short for the configured sequence length.
| usable = ((tokens.numel() - 1) // seq_len) * seq_len | |
| usable = ((tokens.numel() - 1) // seq_len) * seq_len | |
| if usable <= 0: | |
| raise ValueError( | |
| f"Validation set too short for seq_len={seq_len}: " | |
| f"only {tokens.numel()} tokens available." | |
| ) |
| # Physics simulation (runs on rank 0's GPU, broadcast result) | ||
| torch.cuda.empty_cache() | ||
| log0(f"\n Physics simulation ({args.physics_dim}D, {args.physics_steps} steps)...") | ||
| torch.manual_seed(args.seed) | ||
| torch.cuda.manual_seed(args.seed) | ||
| pos = torch.nn.Parameter(torch.randn(V, args.physics_dim, device=device) * 0.1) | ||
| opt_sim = torch.optim.Adam([pos], lr=0.05) | ||
| sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt_sim, args.physics_steps, eta_min=0.0005) | ||
|
|
||
| src_t = torch.tensor(np.concatenate([rows, cols]), dtype=torch.long, device=device) | ||
| dst_t = torch.tensor(np.concatenate([cols, rows]), dtype=torch.long, device=device) | ||
| sw_t = torch.tensor(np.concatenate([spring_w, spring_w]), dtype=torch.float32, device=device) | ||
| mass_t = torch.tensor(entropic_mass, dtype=torch.float32, device=device) | ||
| asym_t = torch.tensor(asymmetry, dtype=torch.float32, device=device) | ||
| dsrc_t = torch.tensor(np.concatenate([dir_rows, dir_cols]), dtype=torch.long, device=device) | ||
| ddst_t = torch.tensor(np.concatenate([dir_cols, dir_rows]), dtype=torch.long, device=device) | ||
| dw_t = torch.tensor(np.concatenate([dir_w_vals, dir_w_vals]), dtype=torch.float32, device=device) | ||
| n_rep = min(80000, V*(V-1)//2) | ||
| sf = (V*(V-1)//2) / n_rep | ||
| # CPU RNG for deterministic physics across different GPU hardware | ||
| phys_rng = torch.Generator() | ||
| phys_rng.manual_seed(12372) | ||
| t0 = time.time() | ||
|
|
||
| for step in range(args.physics_steps): | ||
| opt_sim.zero_grad() | ||
| n_ss = min(200000, len(src_t)) | ||
| si = torch.randint(0, len(src_t), (n_ss,), generator=phys_rng).to(device) | ||
| d = pos[src_t[si]] - pos[dst_t[si]] | ||
| E_spring = (len(src_t)/n_ss) * torch.sum(sw_t[si] * torch.sum(d**2, dim=1)) | ||
|
|
||
| ri = torch.randint(0, V, (n_rep,), generator=phys_rng).to(device) | ||
| rj = torch.randint(0, V-1, (n_rep,), generator=phys_rng).to(device) | ||
| rj = rj + (rj >= ri).long() | ||
| E_rep = sf * torch.sum(1.0 / torch.norm(pos[ri]-pos[rj], dim=1).clamp(min=1e-4)) | ||
|
|
||
| ai_idx = torch.where(asym_t > asym_t.median())[0] | ||
| n_ap = min(2000, len(ai_idx)*(len(ai_idx)-1)//2) | ||
| if n_ap > 0 and len(ai_idx) > 1: | ||
| ai = ai_idx[torch.randint(0, len(ai_idx), (n_ap,), generator=phys_rng).to(device)] | ||
| aj = ai_idx[torch.randint(0, len(ai_idx), (n_ap,), generator=phys_rng).to(device)] | ||
| mk = ai != aj; ai, aj = ai[mk], aj[mk] | ||
| E_torsion = 0.5 * torch.sum( | ||
| asym_t[ai]*asym_t[aj] / torch.norm(pos[ai]-pos[aj], dim=1).clamp(min=1e-4) | ||
| ) if len(ai) > 0 else torch.tensor(0.0, device=device) | ||
| else: | ||
| E_torsion = torch.tensor(0.0, device=device) | ||
|
|
||
| gi = torch.randint(0, V, (n_rep,), generator=phys_rng).to(device) | ||
| gj = torch.randint(0, V-1, (n_rep,), generator=phys_rng).to(device) | ||
| gj = gj + (gj >= gi).long() | ||
| E_grav = -sf * 0.1 * torch.sum( | ||
| mass_t[gi]*mass_t[gj] / torch.norm(pos[gi]-pos[gj], dim=1).clamp(min=1e-4)) | ||
|
|
||
| if len(dsrc_t) > 0: | ||
| n_ds = min(100000, len(dsrc_t)) | ||
| di = torch.randint(0, len(dsrc_t), (n_ds,), generator=phys_rng).to(device) | ||
| dd = pos[dsrc_t[di]] - pos[ddst_t[di]] | ||
| E_dir = 0.3 * (len(dsrc_t)/n_ds) * torch.sum(dw_t[di] * torch.sum(dd**2, dim=1)) | ||
| else: | ||
| E_dir = torch.tensor(0.0, device=device) | ||
|
|
||
| (E_spring + E_rep + E_torsion + E_grav + E_dir).backward() | ||
| torch.nn.utils.clip_grad_norm_([pos], 10.0) | ||
| opt_sim.step(); sched.step() | ||
| if step % 1000 == 0: | ||
| log0(f" physics step {step} ({time.time()-t0:.0f}s)") | ||
|
|
||
| positions = pos.detach().cpu().numpy() | ||
| del pos, opt_sim, sched | ||
| torch.cuda.empty_cache() | ||
| log0(f" Physics done ({time.time()-t0:.0f}s)") | ||
|
|
||
| # Hessian eigendecomposition | ||
| log0(f" Computing Hessian...") | ||
| coupling = np.zeros((V, V), dtype=np.float32) | ||
| for k in range(len(rows)): | ||
| w = float(spring_w[k]) | ||
| coupling[rows[k], cols[k]] += 2 * w | ||
| coupling[cols[k], rows[k]] += 2 * w | ||
| coupling += np.outer(entropic_mass, entropic_mass) * 0.1 | ||
| for k in range(len(dir_rows)): | ||
| v = float(dir_w_vals[k]) * 0.3 | ||
| coupling[dir_rows[k], dir_cols[k]] += v | ||
| coupling[dir_cols[k], dir_rows[k]] += v | ||
| chunk = 256 | ||
| for i in range(0, V, chunk): | ||
| ie = min(i+chunk, V) | ||
| diff = positions[i:ie, None, :] - positions[None, :, :] | ||
| d = np.linalg.norm(diff, axis=-1) | ||
| d = np.maximum(d, 1e-8) | ||
| coupling[i:ie] += (1.0 / (d**3)).astype(np.float32) | ||
| np.fill_diagonal(coupling, 0) | ||
|
|
||
| coupling_t = torch.from_numpy(coupling.astype(np.float64)) | ||
| evals_all, evecs_all = torch.linalg.eigh(coupling_t) | ||
| idx_ = torch.argsort(evals_all, descending=True)[:args.hessian_modes] | ||
| evals = evals_all[idx_].numpy() | ||
| evecs = evecs_all[:, idx_].numpy() | ||
| # Fix eigh sign ambiguity — make largest element in each column positive | ||
| for i in range(evecs.shape[1]): | ||
| if evecs[np.argmax(np.abs(evecs[:, i])), i] < 0: | ||
| evecs[:, i] *= -1 | ||
| hessian_coords = (evecs * np.sqrt(np.abs(evals))[None, :]).astype(np.float32) | ||
| log0(f" Hessian: {hessian_coords.shape}, eigenvalues: {evals[0]:.2f} → {evals[-1]:.2f}") | ||
|
|
||
| dir_scale = 0.5 * np.std(hessian_coords) / (np.std(directional_coords) + 1e-8) | ||
| syn_scale = 0.3 * np.std(hessian_coords) / (np.std(syntactic_coords) + 1e-8) | ||
| combined = np.concatenate([ | ||
| hessian_coords, | ||
| directional_coords[:, -32:] * dir_scale, | ||
| syntactic_coords[:, -32:] * syn_scale, | ||
| ], axis=1).astype(np.float32) | ||
|
|
||
| log0(f" Manifold ready: {combined.shape} ({time.time()-t_total:.0f}s total)") | ||
| return combined |
There was a problem hiding this comment.
build_manifold_distributed says the physics simulation “runs on rank 0's GPU, broadcast result”, but the code currently runs the physics simulation + Hessian eigendecomposition on every rank. In multi-GPU runs this duplicates the most expensive work and can blow the wallclock budget. Consider gating the physics/Hessian section with if rank == 0, then broadcasting the resulting combined manifold coordinates (e.g., via dist.broadcast on a tensor) to other ranks.
| import subprocess | ||
| import sys |
There was a problem hiding this comment.
There are unused imports (subprocess, sys) in this file. Removing them reduces lint noise and avoids implying there are subprocess/system side effects.
| import subprocess | |
| import sys |
| import torch.distributed as dist | ||
| import torch.nn.functional as F | ||
| from torch import Tensor, nn | ||
| from torch.nn.parallel import DistributedDataParallel as DDP |
There was a problem hiding this comment.
DistributedDataParallel as DDP is imported but not used (the script implements manual reduce/broadcast instead). Dropping the unused import will avoid confusion about whether DDP is expected here.
| from torch.nn.parallel import DistributedDataParallel as DDP |
| import os | ||
|
|
||
|
|
||
|
|
There was a problem hiding this comment.
os is imported twice (near the top and again here). Removing the duplicate import keeps the module header tidy.
| import os |
| @@ -0,0 +1,60 @@ | |||
| # V18 Manifold-Guided Architecture — val_bpb: 0.438 | |||
There was a problem hiding this comment.
The README title reports val_bpb: 0.438, while submission.json reports val_bpb: 0.4343. If 0.438 is meant to be the mean across seeds, consider updating the title to say “mean val_bpb” (or update it to the best/official score) to avoid ambiguity.
| # V18 Manifold-Guided Architecture — val_bpb: 0.438 | |
| # V18 Manifold-Guided Architecture — mean val_bpb: 0.438 |
Standard language models must simultaneously construct an internal representation of token relationships and learn to navigate that representation to make predictions. We separate these two jobs.
By precomputing a physics-simulated token manifold from corpus co-occurrence statistics, we freeze the geometric structure directly into the architecture. The model's job changes from construction + navigation to just navigation — a much easier task that lets the weights specialize entirely on exploiting the geometric prior rather than building it from scratch.
The result is essentially a GNN operating on a precomputed token interaction graph — the manifold defines graph topology, sparsemax produces edge weights, and hop cells perform node updates with message passing. Every architecture decision is chosen to exploit this geometric prior: sparsemax routing along manifold geodesics, spectral-coordinate-conditioned attention, entropy-guided message passing, and parallel transport across the token manifold.
With only 1024 tokens, the full pairwise statistics are trivially computable — the manifold captures essentially the complete statistical structure of the language. Normally, a model would need to rediscover these patterns through gradient descent. We hand them to a 20M parameter model on initialization.